System Level Verification and Performance Analysis for FPGA Accelerated Computers

نویسندگان

  • Zhimin Chen
  • Xu Guo
  • Ambuj Sinha
  • Patrick Schaumont
چکیده

System Level Verification and Performance Analysis for FPGA Accelerated Computers Zhimin Chen, Xu Guo, Ambuj Sinha, and Patrick Schaumont Department of Electrical and Computer Engineering Virginia Tech, Blacksburg, VA 24060, USA E-mail: {chenzm, xuguo, ambujs87, schaum}@vt.edu. As an accelerator, Field Programmable Gate Array (FPGA) has become a great potential to assist a general-purpose processor in performance-critical tasks. A common integration approach is to configure the FPGA as a slave device to the general-purpose processor. In this case, the FPGA implements the equivalent of a function call in hardware for the general-purpose processor. For a given application, a designer will then identify the performance-critical part, and implement that part as a hardware unit onto the FPGA fabric. As the FPGA capacity increases, designers can replicate such a unit in one FPGA and obtain a multi-core system. The generalpurpose processor will take care of data-handling (operands and results) for the FPGA, and it will maintain system-control for the overall application. While this accelerator scenario is easy to understand, it creates two important challenges for designers. One challenge is the systemlevel verification in order to obtain a quick edit/compile/debug cycle. The second challenge is performance analysis in a computing paradigm that mixes traditional computer architecture and FPGAbased processing. In this contribution, we share our experiences in system level verification with hardware-software codesign. We also present our method to identify the system bottlenecks and further to analyze the system performance. FPGAs are programmed using Hardware Description Languages (HDL), and instruction-set processors are programmed in a sequential programming language. In the accelerator scenario described above, the hardware design paradigm, common to FPGA programming, must be integrated into the software design paradigm, used by instruction set processors. In this contribution, we consider the use of hardwaresoftware codesign for this issue. We combine RTL design on the FPGA with software programming on the general-purpose processor. This gives us precise control over the part of the application mapped into hardware, and over the design of the hardware-software interface between the FPGA and the general-purpose processor. While it is always possible to use a compile-and-see-what-happens approach to FPGA accelerator design, this approach becomes inefficient for high-end FPGA designs, because the design compilation time is too high. Furthermore, the observability of a design on an FPGA is low, and requires specialized debugging infrastructure. We propose a solution, called host-based cosimulation, that cosimulates the application on the host processor with the RTL design of the accelerator. Host-based cosimulation is capable to verify the RTL without going through the design compilation and to debug only based simulation. Besides system level verification, we also address the performance analysis challenge. We find that such a system, in which the FPGA executes as a slave device, may exhibit three different bottlenecks, which can be attributed to computational limits, communicationbandwidth limits, and storage limits, respectively. We identify five design factors that play critical roles on deciding the system performance, including the number of parallel units that can be implemented in an FPGA, the size of data sent to each unit in each N o r t h

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Reconfigurable Supercomputers and C-to-Hardware Synthesis for CNN Emulation

The complexity of hardware design methodologies represents a significant difficulty for non hardware focused scientists working on CNN-based applications. An emerging generation of Electronic System Level (ESL) design tools is been developed, which allow software-hardware codesign and partitioning of complex algorithms from High Level Language (HLL) descriptions. These tools, together with High...

متن کامل

Accelerated BLAST Performance with Tera-BLASTTM: a comparison of FPGA versus GPU and CPU BLAST implementations

A number of technologies have emerged for accelerating similarity search algorithms in bioinformatics, including the use of field programmable gate arrays (FPGA), graphics processing units (GPU), and clusters of standard multicore CPUs. Here we present Tera-BLASTTM, an FPGA-accelerated implementation of the BLAST algorithm, and compare the performance to GPU-accelerated BLAST and the industry s...

متن کامل

SoC Design Environment with Automated Bus Architecture Generation for Rapid Prototyping with ISS

It is important in SoC design that the design and verification can be done easily and quickly. And RTlevel simulation in verification methods is still necessary. But the usage is limited by its low performance. Therefore we propose a SoC verification environment in which hardware parts are accelerated in FPGA and cores are modeled with ISS. To connect ISS in high abstraction level with emulator...

متن کامل

FPGA-Accelerated Simulation of Computer Systems

To date, the most common form of simulators of computer systems are software-based running on standard computers. One promising approach to improve simulation performance is to apply hardware, specifically reconfigurable hardware in the form of field programmable gate arrays (FPGAs). is manuscript describes various approaches of using FPGAs to accelerate softwareimplemented simulation of compu...

متن کامل

Performance comparison of finite-difference modeling on Cell, FPGA and multi-core computers

How does the performance of Cell, field-programmable gate array (FPGA), and multi-core computers compare for finitedifference modeling of the acoustic wave equation? In this paper I answer this question by assessing implementations on each of these architectures. Results show that on average, 7.49, 5.01, and 3.74 GFLOPs were sustained, respectively, by the FPGA, quad-core, and Cell machines for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010